Learning Optimal Parameter Values in Dynamic Environment: An Experiment with Softmax Reinforcement Learning Algorithm
نویسنده
چکیده
1. Introduction Many learning and heuristic search algorithms require tuning of parameters to achieve optimum performance. In stationary and deterministic problem domains this is usually achieved through off-line sensitivity analysis. However, this method breaks down in non-stationary and non-deterministic environments, where the optimal set of values for the parameters keep changing over time. What is needed in such scenarios is a meta-learning (ML) mechanism that can learn the optimal set of parameters on-line while the learning algor ithm is trying to learn its target concept. In this paper we present a simple meta-learning algorithm to learn the temperature parameter of the Softmax reinforcement-learning (RL) algorithm. We test the effectiveness of this meta-learning algorithm in two domains. The first is the classic reinforcement learning problem known as the k-armed bandit problem. The second domain is a stylized problem in e-procurement. It involves a context of strategic interaction consisting of homogeneous sellers of a single raw material or component vying for business from a single buyer. The sellers are modeled as artificial agents that learn increasingly effective bidding strategies. We model non-stationarity in the first domain of k-armed bandit problem by periodically switching the reward distributions. The second domain is in effect a non-stationary and non-deterministic learning problem since it is a game of strategic interaction involving multiple agents. We model one of the sellers as using the Softmax RL algorithm. We consider the other seller to be using one of several different learning algorithms. We show that the best value of the temperature parameter for the Softmax agent varies depending on the learning algorithm used by the other seller. In both domains we show the improvement in performance brought on by the use of meta-learning.
منابع مشابه
Dynamic Obstacle Avoidance by Distributed Algorithm based on Reinforcement Learning (RESEARCH NOTE)
In this paper we focus on the application of reinforcement learning to obstacle avoidance in dynamic Environments in wireless sensor networks. A distributed algorithm based on reinforcement learning is developed for sensor networks to guide mobile robot through the dynamic obstacles. The sensor network models the danger of the area under coverage as obstacles, and has the property of adoption o...
متن کاملDependency of Parameter Values in Reinforcement Learning for Navigation of a Mobile Robot on the Environment
Reinforcement learning is suitable for navigation of a mobile robot due to its learning ability without supervised information. Reinforcement learning, however, has difficulties. One is its slow learning, and the other is the necessity of specifying its parameter values without prior information. We proposed to introduce sensory signals into reinforcement learning to improve its learning perfor...
متن کاملTask Allocation through Vacancy Chains: Action Selection in Multi-Robot Learning
We present an adaptive multi-robot task allocation algorithm based on vacancy chains, a resource distribution process common in animal and human societies. The algorithm uses individual reinforcement learning of task utilities and relies on the specializing abilities of the members of the group to promote dedicated optimal allocation patterns. We demonstrate through experiments in simulation, t...
متن کاملOperation Scheduling of MGs Based on Deep Reinforcement Learning Algorithm
: In this paper, the operation scheduling of Microgrids (MGs), including Distributed Energy Resources (DERs) and Energy Storage Systems (ESSs), is proposed using a Deep Reinforcement Learning (DRL) based approach. Due to the dynamic characteristic of the problem, it firstly is formulated as a Markov Decision Process (MDP). Next, Deep Deterministic Policy Gradient (DDPG) algorithm is presented t...
متن کاملBridging the Gap Between Value and Policy Based Reinforcement Learning
We establish a new connection between value and policy based reinforcementlearning (RL) based on a relationship between softmax temporal value consistencyand policy optimality under entropy regularization. Specifically, we show thatsoftmax consistent action values satisfy a strong consistency property with optimalentropy regularized policy probabilities along any action sequence...
متن کامل